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Consider these proposed objectives for what eighth grade students should know about 
working with data: 

1. Given any of a variety of common graphs or tables, including bar graphs, time 
series, and two-by-two tables, students will be able to read off the value of a 
specified case and to give the number (or percent) of cases of a specified type. 

2. Given a number of values or a graph, students will be able to determine from 
them statistics including the mean, median, mode, and range. 

Probably all of us would agree that eighth grade students should be proficient in these 
two sets of skills. But few of us would likely regard these skills as being even close to 
comprising all of what we would expect of them. Despite this, we appear to have 
reached a national consensus that instruction in data analysis up to the middle school 
should be primarily concerned with these very basic computation and graph reading 
skills. The primary evidence for this consensus is the fact that nearly 80% of the items 
on high-stakes tests released by various states target these two objectives. 

In this paper we report on our ongoing efforts to identify and assess key ideas in data 
analysis (or statistics) that we maintain should be at the focus of middle school 
instruction. It was in the hopes of locating items that we could use to assess some of 
these more complex objectives that we searched the collection of items released by the 
National Assessment of Educational Progress (NAEP) and the various states. Below, 
we first describe in more detail the nature of items being used on large-scale state 
assessments. We then offer some of our views on what we should be teaching and 
present some items that we are designing to tap these ideas. 
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What the High-Stakes Tests Are Assessing 
To get a sense of the nature of items currently being used to assess competencies in 
data analysis, we searched on the sites of states that conduct large-scale assessments of 
their students. We confined our search to items targeted to grades 6 - 10. Many states 
have items available from several different years, and in such cases we looked only at 
those items from the most recent year available. We also looked at the items 
administered by NAEP for grades eight in 1990 , 1992, and 1996. 

We located 264 items from the high-stakes tests of 41 states that fit the above criteria. 
Roughly 41% of these items were described as sample items, 10% were released practice 
items, and the remaining 49% were released items that had appeared on the most recent 
administration of the state's assessment. We also located 10 items on data analysis from 
past administrations of NAEP. This gave us a total of 274 items. 

We coded the 274 items as either "encode/ decode" or "other." In broad terms, we 
included in the encode/ decode category items that asked students to convert raw data 
into a statistic or display (table or graph), or to do the reverse — to determine from a 
data display or a statistic the corresponding data values or frequencies. Figure 1 is an 
example of an item that tests the ability to compute a mean from a set of values, and 
Figure 2 tests the ability to determine from a case value bar graph the case with the 
largest value. 



Insert Figures 1 and 2 about here 



In terms of the distinctions among data skills suggested by Curcio (1987), the 
encode/ decode category corresponds roughly to her descriptor "reading the data." As 
we attempted to communicate in the two objectives in our introduction, these items 
probe students' knowledge of conventions for representing data graphically and of 
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summarizing data with various measures such as frequencies, relative frequencies, 
means, ranges, etc. 

Of the 274 items we analyzed, 78% of the them target encode/ decode skills. Table 1 
lists the states we obtained items from and shows the breakdown of items of each type. 
The states are ordered in the table according to the percentage of encode /decode items. 
We obtained only a few items from many of the states, and for these states the 
percentages of encode/ decode items are questionable indicators of the pattern of items 
on their assessments. Accordingly, we divided the states into two groups in Table 1: 
those from whom we located more than 5 items (top) vs.5 or few items ( bottom). We 
obtained 10 or more items for Minnesota, Mississippi, Ohio, Texas, and Georgia, and in 
each of these states, over 90% of the items target skills at the encode/ decode level. In 
contrast, we obtained 12 items from Kansas where fewer than half of them were 
directed at encoding/ decoding skills. Five of the ten items from NAEP were of the 
encode/ decode variety. 

For the most part, items that we coded as "other" tended to assess higher level skills, 
such as ideas related to sampling, scaling, predicting, choosing between using different 
averages in particular situations, and making decisions or recommendations from the 
data and justifying these. However, this category also included items that in oiu* 
opinion do not involve data analysis. Figure 3 and 4 includes two examples. The item 
in Figure 3 asks students to make a recommendation that satisfies several mathematical 
constraints. The reason we thirdc it was considered as involving data analysis was that 
the information the students were to consider was presented in a table which the 
students had to use correctly if they were to extract the relevant information. 



Insert Figiue 3 about here 



Similarly, the item in Figure 4 asks students to locate a value on a linear function, and 
our guess is that it was considered data analysis by virtue of its being a graph. In 
looking at a number of items like these and the objectives they supposedly assess, our 
sense is that some test developers are interpreting "data analysis" more generally as the 
organization of information ,where the information that is organized need not involve 
data in the statistical sense. 



Insert Figure 4 about here 



Higher-Level Objectives And Items 

Based on our analyses, it appears that our current high-stakes assessments are virtually 
ignoring all but the most rudimentary skills involved in data analysis. We assume that 
at least part of the reason for this neglect results from a lack of clarity about what the 
higher-level objectives in data analysis might be or about how one might assess these 
objectives using formats appropriate for wide-scale testing. Accordingly, we offer 
below our view on larger objectives and ways to assess them. 

We do not attempt to enumerate what we think the objectives in data analysis at the 
middle school should be. Rather, we focus on three overarching ideas and related 
skills that we believe should be near the top of such a list. These are: 

1. comparing two groups, 

2. judging the relationship between two attributes, and 

3. the imderstanding that as a sample grows, measures of group characteristics 
from that sample become more stable and thus more informative. 

The first two objectives are described in the National CoimcU of Teachers of 
Mathematics (NCTM) Principles and Standards for School Mathematics (2000) for the 
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middle school under the heading "Develop and evaluate inferences and predictions that 
are based on data." The Standards suggest that: 

In collecting and representing data, students should be driven by a desire to 
answer questions on the basis of data. In the process, they should make 
observations, inferences, and conjectures and develop new questions (p. 251- 
252). 

The Standards break this down into the more specific expectations that in these grades 
students should: 

• use observations about differences between two or more samples to make 
conjectures about the populations from which the samples were taken; 

• make conjectures about possible relationships between two characteristics of 
a sample... (p. 248) 

It is interesting to note that these expectations speak of making "inferences," and yet 
formal inferential techniques (i.e., t-tests, Chi-square, confidence intervals) are not part 
of the Standards for the middle school. Our own view is that we ought to be helping 
elementary and middle school students develop ideas that support making inference 
from samples and which are precursors to formal tests of inferences. The 
understanding outlined in objective 3 above is such an idea. 

We should add that our main reason for developing these items is to use them to help 
gauge the effectiveness of instructional materials we are developing, and in this effort 
we are collaborating with several other statistics education projects. Our initial hope 
was that we could use items that had appeared on the high-stakes tests for our purposes 
and began developing our own only after we found so few that targeted higher-level 
objects. But we also hope as part of this effort to nudge the more general discussion of 
what the key ideas of data analysis are and how we can help our students develop 
them. Developing items that we can all agree assess those ideas not only gives us a 



means of gauging our progress; the step of translating our vision into assessment items 
is a critical part of the process of conceptualizing those objectives. 

Group Comparison 

Making comparisons between two groups is perhaps the most fundamental and widely 
employed technique in statistics. One of the hopes in integrating data analysis into the 
K-12 curricula is that our citizens will become more facile with interpreting the 
bombardment of claims they encounter about one option being better than another. 

The item in Figure 5 is from the 2001 Kansas Curricular Standards for Mathematics. The 
first part of the question asks the student to construct boxplots from stem and leaf plots. 
We would regard this first part as involving decoding and encoding. The second part 
asks for a judgment about the two groups. This is one of the few items we foimd in our 
search that asks for a group comparison, though the question as worded "what 
inferences could be made. . ." is so open ended that we imagine it would prove difficult 
to score. 



Insert Figure 5 about here 



A notable feature of the problem is that it provides information in a key that might help 
a student unfamiliar with stem and leaf plots to decipher them. Furthermore, we 
assume that students who could not construct a box plot could still demonstrate 
proficiency in comparing the two groups on the second part of the item by interpreting 
the stem and leaf plot (but we did not have access to a scoring protocol to verify this). 
For the items we have been developing to assess higher level objectives, we have tried 
to use plots that research suggests are relatively easy for students to decode (Bright & 
Friel, 1998; Feldman, Konold, & Coulter, 2000). Our intention is to separate the question 
of whether a student can decode a particular plot from the question of whether he or 
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she can perform a more complex analysis based on it. Among other things, this allows 
us to use the item to assess student reasoning before instruction. 

Figure 6 shows a problem we are developing to assess the ability to formulate a valid 
comparison between two groups, in this case to decide which of two headache remedies 
works faster. We have adapted this item from a protocol we developed and used in a 
series of clinical interviews (see Konold, in preparation). 



Insert Figure 6 about here 



Acceptable responses to this item would include claims that the new drug is better 
because e.g., "The average time to relief for people taking the new drug appears to be 
less" or that "The majority of those taking the new drug got relief in less than 1 hour 
compared to a small minority of those taking the old drug." Both of these responses 
entail using a measurement for each group that is derived from all the data in that 
group (an average or a percentage). Based on research with similar problems, we know 
that many students working with data and displays like these employ comparison 
methods that use only small subsets of the data (Gal, Rothchild, & Wagner, 1990; 
Konold, Pollatsek, WeU, & Gagnon, 1997; Watson & Moritz, 1999). These methods 
include comparing numbers of cases in the two groups: 

1. in small slices ("The new drug is better because with it there were about 10 
people who got relief in 50 minutes compared to only about 3 people with the 
old drug.") 

2. in one of the extremes ("The old drug is better because the two people who 
got the fastest relief used the old drug."), or 

3. relative to a cut point ("The new drug is better because about 20 or 30 of that 
group took over 80 minutes compared to only 4 taking the new drug."). 



We use different sample sizes in the two groups so that methods based on comparing 
numbers rather than percentage of cases will be problematic. We also include extreme 
values that contradict the overall trend such that the group with the lower mean has the 
highest two values and the group with the higher mean has the lowest two values. 
Otherwise the spread and shape of the two groups are relatively similar so that 
comparing groups based on their averages is reasonable (see Konold & Pollatsek, 2002). 

Judging Relationships 

Judging whether and how two attributes are related is another critical skill included 
among the objectives for middle school students. Figure 7 shows one of the items we 
are developing to assess this capability. In this case, the student must critique four 
possible plots with respect to this summary. We expect that after field testing the item 
we will revise it to ask simply that students select the option that most closely 
corresponds to the verbal summary. 



Insert Figure 7 about here 



One major difference between this item and the item presented in Figure 3 is that here 
students must not only read a point, but attend to the trend. Furthermore, the trend is 
not linear and it is a noisy one, with plenty of variability. It is the later feature that 
makes this a statistical problem rather than purely mathematical one. Because of all the 
exceptions to the trend, it is not so straight forward to perceive and describe it. 

One of the shortcomings of this particular item is that we know that the scatterplot is 
not a particularly easy representation for students to decode (see Batanero, Estepa, & 
Godino, 1997; Cobb, McClain, & Gravemeijer , in press; Konold & Higgins, 2003; Noss, 
Pozzi, & Hoyles, 1999). We are developing other items that make use of alternative 
representations that students appear to be able to interpret with much less difficulty 
(see Konold, 2002). 







Stability of Measures from Large Samples 



One of the fundamental ideas in statistics is that as an appropriate sample gets larger, 
various properties of its parent population become more visible. These properties 
include the location of the mean, median, measures of spread such as the standard 
deviation and interquartile range, as well as the overall shape of the distribution. This 
insight provides the basis for trusting that samples give us useful information about 
populations and thus for making inferences about the population from the sample. Our 
own sense is that the middle school curricula could do more to help develop this 
insight. Lehrer and his colleagues have developed and tested a number of classroom 
activities which demonstrate that even elementary grade students are quite capable of 
understanding and applying this concept (e.g., Lehrer, Schauble, Strom, and Pligge, 
2001 ). 



Insert Figure 8 about here 



We designed the item in Figure 8 to assess the idea that random samples of the same 
size will basically resemble one another. The item presents a sample of the weights of 
backpacks of 40 randomly chosen individuals. The student must pick from among four 
alternative stacked dot plots the plot most likely to result from adding another 40 data 
points to the sample. From classroom field tests with a similar situation (see Konold & 
Pollatsek, 2002, pg, 283-384), we know that some students will maintain that in a new 
sample anything is possible and that therefore they have no expectations about the 
outcome. (We are adding to this item the option that there is no reason to favor one 
graph over another.) Others student argue (correctly) that as the sample grows in size, 
the range will tend to grow larger. But these students also often expect that the 
distribution will in general become more flattened (option a). Option b is perhaps too 
subtle, but we included it to capture the thinking of those who believe that the second 
sample would be identical to the first. Option c is consistent with the expectation that 



many hold that as a sample gets larger, all aspects of it also get larger (including, for 
example, the mean). 



Conclusions 

Earlier, we speculated that the reason current high-stakes tests focus almost exclusively 
on low level capabilities in data analysis is that either they have a different view of what 
data analysis is or they have concluded that higher level skills are difficult to assess in 
item formats appropriate for standardized test. But it may also be that the test 
developers have consciously decided to assess only the most rudimentary skills, 
perhaps because they are fearful that most students would be incapable of any more 
than that. What ever the reasons for the status quo, the make up of current large-scale 
assessments in our opinion is serving to hinder the development of statistical literacy in 
our students. Once in place, these items as a collection serve to communicate to all the 
stakeholders what the real objectives are. It becomes increasingly difficult in this 
environment to develop and test new approaches and objectives, because teachers 
feeling the pressure to prepare students to do well on these assessment are 
understandably loath to devote class time to topics or skills that are not directly covered 
on them. 
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Appendix: Tables and Figures 



Figure 1. Grade 10 released test item from Louisiana's GEE 21 (Graduation 
Exit Examination for the 21"‘ Century), July 2002. 



Roy compared the price of a tape player at 5 stores. The prices at the 
different stores were $80.00, $95.00, $60.00, $90.00, and $85.00. What 
was the average (mean) price of the tape players? 

a. $415.00 

b. $410.00 

c. $85.00 
c. $82.00 
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CONTINENTS 



Figure 2. Grade 8 practice test item from Minnesota's BST (Basic Skills 
Test), 1998. 



Use Ihc bar ^aph below lo answer question 59. 

Highest Altitudes for each Continent 




0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 

Highest Altitudes 

(In Thousands of Feet) 

59. The highest altitude in the world is located on w^hat continent? 

A. South Amenca 

B. Austral ift 

C. Africa 

D. Asia 




IV 



Figure 3. Grade 8 released test item from the Missouri Assessment 
Program (MAP), 2002. 

10 For the ShallwoocI Middle School Fun Night next .'nonth, 600 students voted for 
their favorite activity. The results and the costs associated with each activity are 
shown in the table below. 



FAVORITE ACTIVITIES FOR FUN NIGHT 



Favorite 

Activity 


Percentage 
Who Voted 


Cost 

(in dollars) 


Playing music 


21 


250 


Movies 


10 


30 


Vol ley ball 


1 5 


10 


Board games 


2 


60 


Arcade games 


25 


200 


Miniature golf 


9 


50 


Free-throv/ shot 


1 5 


1 5 


Face painting 


3 


10 



Fun Night will havo at least three activities, but no more than six acth/ities. The 
connitteecan spend up to $300 on all the acth/ities. 

In the box below write a recon nendati on to the connittee about which activities 
to select for Fun Night. To ensure good attendance, at least 50% of the total 
lumber of students nust ha»/e voted for the con bin at ion of acth/ities that you 
choose. Be sure to provide the work to support your recon nendation vvith 
percentages and costs fron the table. 




1 
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Figure 4. Grade 8 released test item from the Florida Comprehensive 
Assessment Test (FCAT), 2003. 




riv ^nph below ivpiesonls iho fiiuiv.ilonl wei^hls, in pounds, of poopk* on Iho pkiiit*ls 
Voiuis nnd Pnrlh. 



iql'ivai i:nt \vm<;ht.s 




\Vei*4hi on I «irih Un puundb) 



Whni is I ho woic^hl ol\i person on Pni lh if his or her equiwilenl wei^hl on Wmis is 1 1 



pounds 


•» 


A. 


oO pounds 


U. 


►lu pounds 


C. 


>0 pounds 


IK 


ou pounds 
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Figure 5. Grade 7 sample test item from the Kansas Curricular Standards 
for Mathematics, 2001. 



G7.S4.B2.A1.#1 

The stem and leaf plot belov/ represents the number of pages read by each student this week 
in the 3"“ hour and 5'*’ hour English classes. 



3"“ hour 


stem 


5"' hour 


995532 


9 


0113 


877531 


8 


47 


533 


7 


1 2224 56 


0 


6 


49 


74 


5 


88 




4 


25 


9 


3 


7 


Key : 3 I 


I 7 


represents 73 




I ^ 


2 represents 42 



Compare the box-and-vuhiskers plots of the data from these two dasses. What 
inferences could be made about the number of pages read in each class. 
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Figure 6. Item under development, Tinkerplots project, 2003. 

A drug company developed a new formula for their headache medication. To test the 
effectiveness of this new formula, they gave it to 100 people with headaches and timed 
how many minutes it took for the patient to report that the headache had gone. They 
compared the result from this test to previous results from 150 patients using the old 
formula imder the exact same conditions. The results from both these clinical trials are 
shown below. 



Group 




I 1 1 1 1 r 



0 20 40 60 80 100 1 20 

Time to relief (minutes) 



Based on these results, write a short summary of what these data say about the 
effectiveness of the new treatment compared to the old. The summary is for the drug 
company who wants to decide whether to start marketing the new formula. 



Figure 7. Item under development, Tinkerplots project, 2003. 



A tooth paste company did a study of how much brushing was 
required to remove most of the plaque that covers teeth. They studied 
60 people. Each person brushed as they normally would, but were 
told after a certain number of seconds to stop brushing. An 
experimenter than determined the amount of plague remaining on 
that person's teeth. 

The researchers reported the following findings: 

1) In the morning before brushing, plaque typically covers about 55% of the 
surface area of a person's teeth. 

2) Up until about 120 seconds, the longer people brush the more plaque they 
remove. 

3) After about 120 seconds, additional brushing does not appear to remove 
more plaque. 

Below are four possible graphs of the data they collected. For each graph, say 
whether it agrees with these findings or not. If a graph doesn't agree with the 
findings, briefly explain why. 
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Figure 8. Item under development, Tinkerplots project, 2003. 

As part of a campaign to get students to reduce the weight of their backpacks, middle 
school students set up a weighing station inside the main door of the school. They 
randomly selected students as they arrived at school, weighted their packs, and posted 
this information on a graph displayed on the wall. Data from the first 40 students they 
sampled are shown in the graph below. 
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They randomly sampled another 40 students and added their data to the graph on 
the wall. Below are 4 possible graphs, with the new data shown in a different 
color. Which of the graphs do you think is most likely to be the actual graph they 
got after sampling a total of 80 students. 
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Table 1. Distribution of item types in high-stakes assessment exams. 



State 


Encode/ 

Decode 


Other 


%Encode/ 

Decode 


>5 items 








Illinois 


6 


0 


100.0 


Minnesota 


15 


0 


100.0 


Mississippi 


18 


1 


94.7 


Ohio 


13 


1 


92.9 


Texas 


11 


1 


91.7 


Georgia 


10 


1 


90.9 


Utah 


8 


1 


88.9 


Virginia 


8 


1 


88.9 


Arkansas 


7 


1 


87.5 


Massachusetts 


11 


2 


84.6 


Florida 


10 


2 


83.3 


Connecticut 


13 


4 


76.5 


Tennessee 


5 


2 


71.4 


Michigan 


7 


3 


70.0 


New Hampshire 


4 


2 


66.7 


North Carolina 


6 


3 


66.7 


Kentucky 


5 


3 


62.5 


Washington 


3 


3 


50.0 


Colorado 


3 


4 


42.9 


South Carolina 


3 


4 


42.9 


Kansas 


5 


7 


41.7 


<6 items 


Alaska 


2 


0 


100.0 


California 


1 


0 


100.0 


Delaware 


1 


0 


100.0 


Indiana 


3 


0 


100.0 


Louisiana 


2 


0 


100.0 


New York 


1 


0 


100.0 


Oregon 
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0 


100.0 


Pennsylvania 
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100.0 


Wyoming 
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0 


100.0 


Idaho 


4 


1 


80.0 


New Jersey 
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1 


80.0 


Wisconsin 
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1 


80.0 


Maryland 
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2 


60.0 


Maine 
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33.3 


Hawaii 
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0 


Oklahoma 
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